Presented by - Madhurika Situt
• The Global Financial Development Database is an extensive dataset of financial system characteristics for 214 economies.
• The World Development Indicators dataset presents the most current and accurate global development data, and includes national, regional and global estimates.
• The Statistics Division of Food and Agriculture Organization of the United Nations collects annually data on Government Expenditure on Agriculture through a questionnaire, which was developed in partnership with the International Monetary Fund.
• Global Financial Development dataset- This data set is downloaded from https://datacatalog.worldbank.org/dataset/global-financial-development . This data contains financial report of 203 economies from 1960 to 2017 The database includes measures of:
• World Development Indicators dataset- This data set is downloaded from https://datacatalog.worldbank.org/dataset/world-development-indicators .
The primary World Bank collection of development indicators compiled from officially-recognized international sources. It presents the most current and accurate global development data available, and includes national, regional and global estimates.
This data contains financial report of 217 economies from 1960 to 2019.
The database includes measures of- Agriculture and Food Security, Climate Change, Economic Growth, Education, Energy and Extractives, Environment and Natural Resources, Financial Sector Development, Gender, Health, Nutrition and Population, etc.
CSV files
• Government Expenditure dataset- This data set is downloaded from http://www.fao.org/faostat/en/#data/IG/metadata .
The Statistics Division of FAO collects annually data on Government Expenditure on Agriculture through a questionnaire, which was developed in partnership with the International Monetary Fund. The IMF is the responsible institution for the Government Finance Statistics (GFS) methodology and annually collects GFS data, including Expenditure by Functions of Government (COFOG). The Classification of the Functions of Government (COFOG) is an international classification developed by Organization for Economic Co-operation and Development (OECD) and published by the United Nations Statistical Division (UNSD), with the aim of categories governments' functions according to their purposes.
Time coverage - 2001-2018 Columns – 11 Rows- around 60,000
Top 3 countries that has highest investment on agriculture and their returns?
Top 3 countries that has highest returns on agriculture sector and in which year?
Here, for above analysis, i am creating a function main() that takes years for analysis and coutries list as input from user and all the analysis will be carried out in pdf format in results.pdf file for future references
#Loading libraries
import pandas as pd
import numpy as np
import openpyxl
import seaborn as sns
import plotly.graph_objects as go
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
import string
import PyPDF2
import os
pd.set_option('display.float_format', lambda x: '%.5f' % x)
from plotly.subplots import make_subplots
from PyPDF2 import PdfFileWriter, PdfFileReader, PdfFileMerger
import glob
import sys
!pip install PyPDF2
#Loading dataset using openpyxl to get sheetnames
wb = openpyxl.load_workbook('GFDD.xlsx')
wb.sheetnames
#carrying out each sheet file in different dataframes manually
xls = pd.ExcelFile('GFDD.xlsx')
df_data = pd.read_excel(xls, 'Data')
#Carrying out rows and columns dataframes
print("Rows and Columns\n")
print("\ndf_data shape :", df_data.shape)
#carrying out column names for all dataframes
print("Column names\n")
print("\ndf_data shape :", df_data.columns)
#Carrying out specifics rows which have indicator names present in list_indicators
df_GNI=pd.DataFrame(df_data.loc[df_data['Indicator Name'].isin(['Gross National Income (current US$)'])])
#resetting index
df_GNI=df_GNI.reset_index(drop=True)
print(df_GNI.shape)
df_GNI.head()
#Checking dataframe info
print("df_data info-\n", df_GNI.info())
We dont need 'Country Code ' , 'Indicator Name' and 'Indicator Code' columns . So dropping those columns
#Dropping unwanted columns
print("Columns which we are dropping:\n",df_GNI.columns[range(1,4)])
df_GNI=df_GNI.drop(df_GNI.columns[range(1,4)],axis=1)
df_GNI.columns
#Dropping rows that complete null values from column 1960 to 2017
temp_li=df_GNI.columns[range(4,59)]
df_GNI=df_GNI.dropna(how='all',subset=temp_li)
print("Number of null rows\n",(df_GNI.T).isnull().sum())
#Checking for null values in df_data
fig, ax = plt.subplots(figsize=(20,5)) # Sample figsize in inches
sns.heatmap(df_GNI.isnull(),yticklabels=False,cbar=True,cmap='viridis', ax=ax)
#Filling null values with mean values.
list_GFDD_min=[]
list_GFDD_min=list(df_GNI[df_GNI.columns[range(4,59)]].min(axis=1))
list_GFDD_min=[ '%.2f' % elem for elem in list_GFDD_min]
#Filling null values with corresponding mean values row wise
for i in range(df_GNI.shape[0]):
df_GNI.iloc[i]=df_GNI.iloc[i].fillna(list_GFDD_min[i])
#Checking for null values in df_data
fig, ax = plt.subplots(figsize=(20,5)) # Sample figsize in inches
sns.heatmap(df_GNI.isnull(),yticklabels=False,cbar=True,cmap='viridis', ax=ax)
#cleaning string from country names
for i in range(df_GNI.shape[0]):
df_GNI['Country Name'].iloc[i]=df_GNI['Country Name'].iloc[i].replace(' ','')
df_GNI['Country Name'].iloc[i]=df_GNI['Country Name'].iloc[i].capitalize()
#Setting Country name as index
df_GNI=df_GNI.set_index(df_GNI.columns[0])
df_GNI.index
#Carrying out countries having maximum GNI from 1960 to 2017
# Changing datatype from object to float
for i in range(df_GNI.shape[1]):
df_GNI[df_GNI.columns[i]]=df_GNI[df_GNI.columns[i]].astype(float)
#list_max_GNI=list(df_data_GNI.iloc[:, 3:25].max(axis=1))
df_GNI['Max_GNI']=list(df_GNI.iloc[:, 0:58].max(axis=1))
df_GNI.head()
#plotting in bar graph
fig = go.Figure(data=[
go.Bar(name='Maximum GNI', x=df_GNI.index, y=df_GNI.Max_GNI)
])
# Change the bar mode
fig.update_layout(barmode='group',title_text='Maximum gross income in all economic countries',autosize=False,width=1000,height=600)
fig.show()
#Loading datasets
df_GI=pd.read_csv('Investment_GovernmentExpenditure_E_All_Data_(Normalized).csv',encoding='latin-1')
df_GI
#Checking df_GI info
df_GI.info()
#Carrying out specifics rows which have item names present in list
df_GI=pd.DataFrame(df_GI.loc[df_GI['Item'].isin(['Agriculture, forestry, fishing (General Government)',
'Agriculture, forestry, fishing, Recurrent (General Government)',
'Agriculture, forestry, fishing, Capital (General Government)',
'Agriculture (General Government)',
'Agriculture, Recurrent (General Government)',
'Agriculture, Capital (General Government)',
'R&D Agriculture, forestry, fishing (General Government)',
'Agriculture, forestry, fishing, Recurrent (Central Government)',
'Agriculture, forestry, fishing, Capital (Central Government)',
'Agriculture (Central Government)',
'Agriculture, Recurrent (Central Government)',
'Agriculture, Capital (Central Government)',
'R&D Agriculture, forestry, fishing (Central Government)'
])])
#resetting index
df_GI=df_GI.reset_index(drop=True)
print(df_GI.shape)
df_GI
#df_GI columns
df_GI.columns
We just need 'Area', 'Year', 'Item' and 'Value' columns for our analysis. So dropping rest of the columns.
# Removing unwanted columns
list_GI_columns=['Area Code', 'Item Code', 'Element Code', 'Element','Year Code', 'Unit', 'Flag', 'Note']
print("Columns which we are dropping:\n",df_GI[list_GI_columns].columns)
df_GI=df_GI.drop(df_GI[list_GI_columns].columns,axis=1)
df_GI.columns
#Grouping by Area and Year and adding all the values for different items
df_GI=df_GI.groupby(['Area','Year'])['Value'].sum().to_frame().reset_index()
df_GI
# Creating unique list of year and country
list_GI_year=list(range(2001,2018)) # we are just taking years till 2017
# carryout out only top 5 countries having maximum GNI
#list_GI_country=['United States of America','China, mainland','Japan','Germany','United Kingdom' ]
list_GI_country=list(df_GI.Area.unique())
#sorting list
list_GI_year.sort()
list_GI_country
#list_GI_year
#Checking for null values in df_data
fig, ax = plt.subplots(figsize=(20,5)) # Sample figsize in inches
sns.heatmap(df_GI.isnull(),yticklabels=False,cbar=True,cmap='viridis', ax=ax)
#Creating draft dataframe
# Creating dataframe
df_GI_v1=pd.DataFrame(columns=list_GI_country, index=list_GI_year)
df_GI_v1.head()
#Filling the draft dataframe df_GI_v1 with values from df_GI dataframe
for k in range(df_GI.shape[0]):
for i in range(len(list_GI_year)):
for j in range(len(list_GI_country)):
if df_GI['Year'].iloc[k]== list_GI_year[i] and df_GI['Area'].iloc[k] == list_GI_country[j]:
df_GI_v1[df_GI_v1.columns[j]].iloc[i]=df_GI['Value'].iloc[k]
#df_GI_v1 columns and rows
df_GI_v1.shape
#df_GI_v1 info
df_GI_v1.info()
#Checking for null values in df_data
df_GI_v1.isnull().sum()
fig, ax = plt.subplots(figsize=(20,5)) # Sample figsize in inches
sns.heatmap(df_GI_v1.isnull(),yticklabels=False,cbar=True,cmap='viridis', ax=ax)
#Dropping columns that have complete null values
df_GI_v1=df_GI_v1.dropna(axis='columns', how='all')
#Filling null values with average values column wise
df_GI_v1=df_GI_v1.fillna(df_GI_v1.min())
df_GI_v1.shape
#Checking for null values in df_data
df_GI_v1.isnull().sum()
fig, ax = plt.subplots(figsize=(20,5)) # Sample figsize in inches
sns.heatmap(df_GI_v1.isnull(),yticklabels=False,cbar=True,cmap='viridis', ax=ax)
#cleaning string from list_GI_country columns
df_GI_v1_country_list=list(df_GI_v1.columns)
for i in range(len(df_GI_v1_country_list)):
df_GI_v1_country_list[i]=df_GI_v1_country_list[i].replace(' ','')
df_GI_v1_country_list[i]=df_GI_v1_country_list[i].capitalize()
df_GI_v1.columns=df_GI_v1_country_list
df_GI_v1.columns
df_GI_v1.head()
#Assign values to each exam dataframe and carrying out in to excel sheet
#output = pd.ExcelWriter('df_GI_v1.xlsx')
#df_GI_v1.to_excel(output, sheet_name='sheet1')
#output.save()
#Plotting population increase in plotly
fig = go.Figure()
for i in range(df_GI_v1.shape[1]):
fig.add_trace(go.Scatter(x=df_GI_v1.index, y=df_GI_v1[df_GI_v1.columns[i]], name=df_GI_v1.columns[i]))
#fig.add_trace(go.Scatter(x=df_GI_v1.index, y=df_GI_v1['Brazil'], name=df_GI_v1.columns[0]))
fig.update_layout(title_text='Government expenditure on Agriculture sector for different economies')
fig.show()
#Loading datasets
df_WDI=pd.read_csv('WDI_csv/WDIData.csv')
df_WDI
Here, we are to take specific columns and specific indicator name ('Agriculture, forestry, and fishing, value added (current US$')for our analysis .
#Carrying out specifics rows which have indicator names present in list_indicators
df_WDI=pd.DataFrame(df_WDI.loc[df_WDI['Indicator Name'].isin(['Agriculture, forestry, and fishing, value added (current US$)'])])
#resetting index
df_WDI=df_WDI.reset_index(drop=True)
print(df_WDI.shape)
df_WDI.columns
# Removing unwanted columns
df_WDI_columns=['Country Code','Indicator Name','Indicator Code','2018','2019']
print("Columns which we are dropping:\n",df_WDI_columns)
df_WDI=df_WDI.drop(df_WDI_columns,axis=1)
df_WDI.columns
#Dropping rows that complete null values
temp_WDI_list=df_WDI.columns[range(1,59)]
df_WDI=df_WDI.dropna(how='all',subset=temp_WDI_list)
#df_WDI information
df_WDI.info()
#Checking for null values in df_data
fig, ax = plt.subplots(figsize=(20,5)) # Sample figsize in inches
sns.heatmap(df_WDI.isnull(),yticklabels=False,cbar=True,cmap='viridis', ax=ax)
#Filling null values with mean values.
df_WDI_min=[]
df_WDI_min=list(df_WDI[df_WDI.columns[range(1,59)]].min(axis=1))
df_WDI_min=[ '%.2f' % elem for elem in df_WDI_min]
#Filling null values with corresponding mean values row wise
for i in range(df_WDI.shape[0]):
df_WDI.iloc[i]=df_WDI.iloc[i].fillna(df_WDI_min[i])
#cleaning string from country names
for i in range(df_WDI.shape[0]):
df_WDI['Country Name'].iloc[i]=df_WDI['Country Name'].iloc[i].replace(' ','')
df_WDI['Country Name'].iloc[i]=df_WDI['Country Name'].iloc[i].capitalize()
# Checking for null values in df_data
fig, ax = plt.subplots(figsize=(20,5)) # Sample figsize in inches
sns.heatmap(df_WDI.isnull(),yticklabels=False,cbar=True,cmap='viridis', ax=ax)
#Carrying out coutries list from df_GI_v1,df_GNI,df_WDI
GFDD_country_list=list(df_GNI.index)
GI_country_list=list(df_GI_v1.columns)
WDI_country_list=list(df_WDI['Country Name'].unique())
#Carrying out country list common in all three datasets
country_list=list(set(WDI_country_list) & set(GI_country_list) & set(GFDD_country_list))
country_list.sort()
print(country_list)
len(country_list)
Updating all three dataframes and keeping the rows that contains data of information about country_list items
#Carrying out specifics rows for each dataframe which have indicator names present in list_indicators
# for df_GNI
df_GNI=(pd.DataFrame(df_GNI.loc[df_GNI.index.isin(country_list)])).T
print(df_GNI.shape)
df_GNI.head()
df_GNI=df_GNI.drop(index='Max_GNI')
# converting datatype of index to numbers
df_GNI.index=df_GNI.index.astype(int)
df_GNI.head()
#for df_GI
df_GI_v1=df_GI_v1[country_list]
df_GI_v1=df_GI_v1*1000000 #multiplying with 1000000 since unit of this dataset is in millions
print(df_GI_v1.shape)
df_GI_v1.head()
#for df_WDI
#Setting Country name as index
df_WDI=df_WDI.set_index(df_WDI.columns[0])
df_WDI=(pd.DataFrame(df_WDI.loc[df_WDI.index.isin(country_list)])).T
print(df_WDI.shape)
# converting datatype of index to numbers
df_WDI.index=df_WDI.index.astype(int)
df_WDI.head()
Since government expenditure data set contains data for years starting from 2000. So we we will be performing analysis from 2001 to 2017( For 17 years)
In below function, years and countries are provided dynamically by user.For this analysis, dataset used are Government expenditure (df_GI_v1) and World development indicators(df_WDI)
#How much government has invested in agriculture and carry out agriculture returns in all chosen countries
def analysis1(start,end,country_list):
year=(list(range(start,end)))
year=[str(x) for x in year]
#Carrying out specific data as per start year , end year and country list
df_WDI_v1=pd.DataFrame(df_WDI.loc[df_WDI.index.isin(year)])
df_WDI_v1=df_WDI_v1[country_list]
df_GI_v2=pd.DataFrame(df_GI_v1.loc[df_GI_v1.index.isin(year)])
df_GI_v2=df_GI_v2[country_list]
# visualising the hypothesis using histograms
#plotting in bar graph
fig = go.Figure()
for i in range(len(country_list)):
fig.add_trace(go.Bar(name=country_list[i] + ' Government Investment on Agriculture', x=df_GI_v2.index, y=df_GI_v2[country_list[i]]))
fig.add_trace(go.Scatter(name=country_list[i] + ' Agriculture Returns', x=df_WDI_v1.index, y=df_WDI_v1[country_list[i]]))
# Change the bar mode
fig.update_layout(barmode='group',title_text='Government investment on agriculture and its returns',autosize=True,width=1000,height=600,
xaxis = dict(tickmode = 'linear'
))
return fig
In below function, years and countries are provided dynamically by user.For this analysis, dataset used are Government expenditure (df_GI_v1) and World development indicators(df_WDI)
#Top 3 countries that has highest investment on agriculture and their returns
def analysis2(start_A2,end_A2,countries_A2):
list_year_A2=list(range(start_A2,end_A2+1))
print(list_year_A2)
list_year_A2_GI_index=[]
#carrying out index of years from government expenditure table
for i in range(len(list_year_A2)):
list_year_A2_GI_index.append(list(df_GI_v1.index).index((list_year_A2[i])))
#carrying out index of years from agriculture returns table
list_year_A2_WDI_index=[]
for i in range(len(list_year_A2)):
list_year_A2_WDI_index.append(list(df_WDI.index).index((list_year_A2[i])))
temp_A2=[]
for i in range(len(list_year_A2_GI_index)):
temp_A2.append(pd.DataFrame(df_GI_v1[countries_A2].iloc[list_year_A2_GI_index[i]]))
df_temp_A2=pd.DataFrame(index=countries_A2)
for i in range(len(temp_A2)):
df_temp_A2[(list(temp_A2[i].columns))[0]]=list(temp_A2[i][temp_A2[i].columns[0]])
df_temp_A2=df_temp_A2.T
list_year=[]
list_GI_country=[]
list_GI_max_values=[]
list_Agri_return=[]
temp=(pd.DataFrame(df_temp_A2.max()))
temp.columns=['Max']
temp=temp.sort_values('Max',ascending=False)
temp=temp.head(3)
list_GI_country=list(temp.index)
#print(list_GI_country)
#print(temp)
#creating draft dataframe for analysis
df_GI_WDI_columns=[list_GI_country[0]+'_Govt_invest',
list_GI_country[0]+'_Agri_return',
list_GI_country[1]+'_Govt_invest',
list_GI_country[1]+'_Agri_return',
list_GI_country[2]+'_Govt_invest',
list_GI_country[2]+'_Agri_return',
]
df_GI_WDI=pd.DataFrame(columns=df_GI_WDI_columns,index=df_temp_A2.index)
#assigning values to dataframe
for i in range(len(list_GI_country)):
df_GI_WDI[list_GI_country[i]+'_Govt_invest']=list(df_temp_A2[list_GI_country[i]])
for j in range(len(list_year_A2_WDI_index)):
df_GI_WDI[list_GI_country[i]+'_Agri_return'].iloc[j]=df_WDI[list_GI_country[i]].iloc[list_year_A2_WDI_index[j]]
#plotting analysis
fig = make_subplots(rows=3, cols=2)
fig.add_trace(go.Bar(name=df_GI_WDI.columns[0] + ' Government Investment on Agriculture', x=df_GI_WDI.index, y=df_GI_WDI[df_GI_WDI.columns[0]]),row=1, col=1)
fig.add_trace(go.Scatter(name=df_GI_WDI.columns[1] + ' Agriculture Returns', x=df_GI_WDI.index, y=df_GI_WDI[df_GI_WDI.columns[1]]),row=1,col=2)
fig.add_trace(go.Bar(name=df_GI_WDI.columns[2] + ' Government Investment on Agriculture', x=df_GI_WDI.index, y=df_GI_WDI[df_GI_WDI.columns[2]]),row=2, col=1)
fig.add_trace(go.Scatter(name=df_GI_WDI.columns[3] + ' Agriculture Returns', x=df_GI_WDI.index, y=df_GI_WDI[df_GI_WDI.columns[3]]),row=2,col=2)
fig.add_trace(go.Bar(name=df_GI_WDI.columns[4] + ' Government Investment on Agriculture', x=df_GI_WDI.index, y=df_GI_WDI[df_GI_WDI.columns[4]]),row=3, col=1)
fig.add_trace(go.Scatter(name=df_GI_WDI.columns[5] + ' Agriculture Returns', x=df_GI_WDI.index, y=df_GI_WDI[df_GI_WDI.columns[5]]),row=3,col=2)
fig.update_layout(height=1000, width=1500, title_text="Government Investment on Agriculture and its returns")
return fig
In below function, years and countries are provided dynamically by user.For this analysis, dataset used are World development indicators(df_WDI)
#Top 3 countries that has highest returns on agriculture sector and in which year
def analysis3(start_A3,end_A3,countries_A3):
list_year_A3=list(range(start_A3,end_A3+1))
list_year_A3_WDI_index=[]
#carrying out index of years from agriculture returns table
for i in range(len(list_year_A3)):
list_year_A3_WDI_index.append(list(df_WDI.index).index((list_year_A3[i])))
temp_A3=[]
for i in range(len(list_year_A3_WDI_index)):
temp_A3.append(pd.DataFrame(df_WDI[countries_A3].iloc[list_year_A3_WDI_index[i]]))
# creating draft dataframe
df_temp_A3=pd.DataFrame(index=countries_A3)
for i in range(len(temp_A3)):
df_temp_A3[(list(temp_A3[i].columns))[0]]=list(temp_A3[i][temp_A3[i].columns[0]])
df_temp_A3=df_temp_A3.T
df_temp_A3_v1=pd.DataFrame(df_temp_A3.max())
df_temp_A3_v1.columns=['Max']
df_temp_A3_v1=df_temp_A3_v1.sort_values('Max',ascending=False)
df_temp_A3_v1=df_temp_A3_v1.head(3)
#creating draft dataframe for analysis
df_WDI_A3=pd.DataFrame(columns=df_temp_A3_v1.index,index=df_temp_A3.index)
for i in range(df_WDI_A3.shape[1]):
df_WDI_A3[df_WDI_A3.columns[i]]=list(df_temp_A3[df_WDI_A3.columns[i]])
df_WDI_A3
#carrying out total agriculture returns in particular year from df_WDI dataframe for years from list_year_A3
list_total_agriculture=[]
for i in range(len(list_year_A3_WDI_index)):
#print(df_WDI.index[list_year_A3_WDI_index[i]])
list_total_agriculture.append(sum([float(i) for i in list(df_WDI.iloc[list_year_A3_WDI_index[i]])]))
# assigning values to new column 'Total_Agri' in df_WDI_A3
df_WDI_A3['Total_Agri']=list_total_agriculture
#Carrying out how much percentage of Total agriculture have these top 3 countries contributed
for i in range(len(list(df_temp_A3_v1.index))):
df_WDI_A3[list(df_temp_A3_v1.index)[i]+'_percentage']=round((df_WDI_A3[list(df_temp_A3_v1.index)[i]]/df_WDI_A3['Total_Agri'])*100,2)
#plotting in bar plotly
import plotly.graph_objects as go
x = df_WDI_A3
fig = go.Figure()
for i in range(len(list(df_temp_A3_v1.index))):
fig.add_trace(go.Bar(x=list(x.index), y=list(x[x.columns[i]]) ,name=x.columns[i], text=[s + '%' for s in list(df_WDI_A3[list(df_temp_A3_v1.index)[i]+'_percentage'].astype(str))] ,textposition='auto'))
fig.update_layout(barmode='relative', title_text='Percentage of total agriculture returns by Top 3 countries',autosize=True)
return fig
In below function, years and countries are provided dynamically by user.For this analysis, dataset used are Gross national income (df_GNI) and World development indicators(df_WDI)
def analysis4(start_A4,end_A4,countries_A4):
list_year_A4=list(range(start_A4,end_A4+1))
list_year_A4_GNI_index=[]
#carrying out index of years from agriculture returns table
for i in range(len(list_year_A4)):
list_year_A4_GNI_index.append(list(df_GNI.index).index((list_year_A4[i])))
#carrying out index of years from agriculture returns table
list_year_A4_WDI_index=[]
for i in range(len(list_year_A4)):
list_year_A4_WDI_index.append(list(df_WDI.index).index((list_year_A4[i])))
#carrying out data in new dataframe
temp_A4=[]
for i in range(len(list_year_A4_WDI_index)):
temp_A4.append(pd.DataFrame(df_WDI[countries_A4].iloc[list_year_A4_WDI_index[i]]))
df_temp_A4=pd.DataFrame(index=countries_A4)
for i in range(len(temp_A4)):
df_temp_A4[(list(temp_A4[i].columns))[0]]=list(temp_A4[i][temp_A4[i].columns[0]])
df_temp_A4=df_temp_A4.T
df_temp_A4
#creating year column for ploting purpose
df_temp_A4['Year']=list(df_temp_A4.index)
#Carrying out percentage of GNI by each country's agriculture returns and assigning it in different column
for i in range(len(countries_A4)):
df_temp_A4[countries_A4[i]+'_%_in_GNI']=list(round(((df_temp_A4[countries_A4[i]])/(df_GNI[countries_A4[i]].iloc[list_year_A4_GNI_index]))*100,2))
df_temp_A4
#plotting in bar graph
fig = go.Figure()
table_values=[]
#carrying out dataframe values to list of lists
for i in range((df_temp_A4[df_temp_A4.columns[len(countries_A4):]]).shape[1]):
table_values.append(((df_temp_A4[df_temp_A4.columns[len(countries_A4):]])[(df_temp_A4[df_temp_A4.columns[len(countries_A4):]]).columns[i]]).values.tolist())
fig1 = go.Figure(data=[go.Table(header=dict(values=df_temp_A4.columns[len(countries_A4):]),
cells=dict(values=(table_values))
)])
fig1.update_layout(title_text="Table for Percentage of GNI by each country's agriculture returns ",autosize=True,width=1000,height=600)
for i in range(len(countries_A4)):
fig.add_trace(go.Scatter(name=countries_A4[i] + ' Agriculture Returns', x=df_temp_A4.index, y=df_temp_A4[countries_A4[i]],text=[s + '%' for s in list(df_temp_A4[countries_A4[i]+'_%_in_GNI'].astype(str))]))
# Change the bar mode
fig.update_layout(barmode='group',title_text="Percentage of GNI by each country's agriculture returns ",autosize=True,width=1000,height=600,
xaxis = dict(tickmode = 'linear'))
return fig1,fig
def start_year():
try:
s_year=int(input('Enter start year from 2001 to 2017 -'))
if s_year in list(range(2001,2018)):
return int(s_year)
else:
print('Out of range. Please enter correct year ')
start_year()
except ValueError:
print("Error! This is not a number. Try again.")
main()
return
def end_year():
try:
e_year=int(input('Enter end year from 2001 to 2017 -'))
if e_year in list(range(2001,2018)):
return int(e_year)
else:
print('Out of range. Please enter correct year ')
end_year()
except ValueError:
print("Error! This is not a number. Try again.")
main()
return
def number_of_countries():
try:
num=int(input('Enter number of countries you want to check -'))
return num
except ValueError:
print("Error! This is not a number. Try again.")
main()
return
Below function 'country_name_clean' is created to clean the text input from country list provided by user
def country_name_clean(country_name):
#removing white spaces
country_name=country_name.replace(' ','')
#removing punctuations
country_name=country_name.translate(str.maketrans('', '', string.punctuation))
# capitalizing first letter of string
country_name=country_name.capitalize()
return country_name
def country_list():
country_list=[]
num=int(number_of_countries())
for i in range(num):
country_name=str(input('Enter country name -'))
if country_name == None:
print('Enter correct country name')
country_list()
else:
country_name=country_name_clean(country_name)
if country_name in df_WDI.columns:
country_list.append(country_name)
else:
print('Country name does not found. Please try again.')
main()
return country_list,num
def merger(output_path, input_paths):
pdf_writer = PdfFileWriter()
for path in input_paths:
pdf_reader = PdfFileReader(path, strict=False)
for page in range(pdf_reader.getNumPages()):
pdf_writer.addPage(pdf_reader.getPage(page))
with open(output_path, 'wb') as fh:
pdf_writer.write(fh)
def main():
choice ='0'
while choice =='0':
start= (start_year())
end= (end_year())
if end < start:
print('End year should be greater than start year')
end= end_year()
countries=[]
countries,count= country_list()
def choose():
print("\nMain Analysis menu: Choose 1 of 5 choices\n")
print("\n1. How much government has invested in agriculture and carry out agriculture returns in all chosen countries")
print("2. Top 3 countries that has highest investment on agriculture and their returns")
print("3. Top 3 countries that has highest returns on agriculture sector and in which year")
print("4. How much contribution (in percentage) does agriculture and forestry takes place in Gross National Income of the country")
print("5. Exit and carry out report in PDF format")
choice=str(input("Please make a choice:"))
if choice == "5":
print("\n\nThank you")
print("Please open results.pdf")
if __name__ == '__main__':
paths = glob.glob('*.pdf')
paths.sort()
merger('results.pdf', paths)
elif choice == "4":
fig_analysis41,fig_analysis42=analysis4(start,end,countries)
fig_analysis41.show()
fig_analysis42.show()
fig_analysis41.write_image('analysis4_1.pdf',width='1200',height='600')
fig_analysis42.write_image('analysis4_2.pdf',width='1200',height='600')
choose()
elif choice == "3":
if count > 3:
fig_analysis3=analysis3(start,end,countries)
fig_analysis3.show()
fig_analysis3.write_image('analysis3.pdf',width='1200',height='600')
choose()
else:
print("This is valid only for number of countries greater than 3")
choose()
elif choice == "2":
if count > 3:
fig_analysis2=analysis2(start,end,countries)
fig_analysis2.show()
fig_analysis2.write_image('analysis2.pdf',width='1200',height='600')
choose()
else:
print("This is valid only for number of countries greater than 3")
choose()
elif choice == "1":
fig_analysis1=analysis1(start,end,countries)
fig_analysis1.show()
fig_analysis1.write_image('analysis1.pdf',width='1200',height='600')
choose()
else:
print("I don't understand your choice.")
choose()
return
choose()
choice='1'
return
For executing all analysis, I have created menu with user interface where user can have selective year analysis where user must enter start year and end year between 2001 to 2017.
Also, the interface allows user to input selective countries for which user wants analysis on. Such as – 'Australia', 'Austria', 'Argentina', 'Denmark', 'Belarus', 'Nigeria', 'Egypt’
After all the analysis, there is one option in menu where user can have all the analysis printed in pdf format (‘results.pdf) and can use it for analyzing or comparing it later.
Below are the screenshots explaining simple example of execution of all the analysis-
main()
Does agriculture play important contribution to Gross National Income for that country on that year. (Predictions will be Yes/No).
Are the government’s investments in the agriculture and forestry sector profitable or not. (Prediction will be Yes/No)
Since dataset is small, so I am using k-nearest neighbors algorithm (k-NN) classification model for both the hypothesis as its giving me better accuracy than other models.
k-nearest neighbors algorithm (k-NN)-
It is supervised machine learning algorithm that relies on labeled input data to learn a function that produces an appropriate output when given new unlabeled data.
This prediction is for indiviual country based on their data over the years. So preparing a model based on Argentina country data
#Additional libraries to create model
from sklearn.preprocessing import LabelEncoder #label encoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
# creating dataframe
df_hypothesis1=pd.DataFrame(index=df_GNI.index)
df_hypothesis1['Agriculture_returns']=df_WDI['Argentina'].astype(float)
df_hypothesis1['GNI']=df_GNI['Argentina']
df_hypothesis1['%_in_GNI']=round(((df_hypothesis1['Agriculture_returns']/df_hypothesis1['GNI'])*100),2)
df_hypothesis1['Target']=""
# here benchmark is 7.5%. if %_in_GNI is greater then 7.5%, then 'yes' agriculture plays important role in GNI else 'NO'
for i in range(df_hypothesis1.shape[0]):
if df_hypothesis1['%_in_GNI'].iloc[i] > float(7.5):
df_hypothesis1['Target'].iloc[i]=str('Yes')
else:
df_hypothesis1['Target'].iloc[i]=str('No')
print(df_hypothesis1.shape)
df_hypothesis1.head()
# carrying out features and targets
features = df_hypothesis1[df_hypothesis1.columns[range(0,3)]]
target= df_hypothesis1.iloc[:,-1]
# labeling target
le = LabelEncoder()
target=list(le.fit_transform(target))
print(target)
#Scaling features
# Load the standard scaler
sc = StandardScaler()
# Scale the feature data to be of mean 0 and of unit variance
features_scaled = sc.fit_transform(features)
print("features shape",features.shape)
#Creating df_feature_scaled to use it in plot
df_features_scaled=pd.DataFrame(features_scaled, columns=features.columns)
print("features_scaled shape",df_features_scaled.shape)
#Plotting the difference between scaled training data and without scaled data
print("*Standard Scaling removes the mean and scales the data to unit variance as we can see in the plot.\n*However, as outliers of each feature have different magnitudes these variables are not scaled at the same scale.\n*Standard Scalar is not guarranting balanced feature scaling.")
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(20, 10))
ax1.set_title("Before Scaling")
for i in range(features.shape[1]):
sns.kdeplot(features[features.columns[i]],ax=ax1)
ax2.set_title("After Standard Scaling")
for i in range(df_features_scaled.shape[1]):
sns.kdeplot(df_features_scaled[df_features_scaled.columns[i]],ax=ax2)
# Splitting dataset
X_train, X_test, Y_train, Y_test = train_test_split(features_scaled,target,test_size=0.3,random_state=32)
#Creating X_train and X_test dataframes
df_X_train = pd.DataFrame(X_train)
df_X_test = pd. DataFrame(X_test)
print(df_X_train.shape)
print(df_X_test.shape)
#Setting parameters
knn_range = list(range(1,26))
#Checking the k value using GridSearchCV
param_knn = dict(n_neighbors=knn_range)
# Using GridSearchCV to find best parameters for KNN classifier model
grid_knnclf = GridSearchCV(KNeighborsClassifier(), param_knn, cv=10, scoring='accuracy')
grid_knnclf.fit(X_train, Y_train)
print(grid_knnclf.best_score_)
knn_best_param=grid_knnclf.best_params_
print(knn_best_param)
#fitting model
knn_model = KNeighborsClassifier(n_neighbors= knn_best_param['n_neighbors'])
knn_model.fit(X_train, Y_train)
# knn predictions
knn_predictions= knn_model.predict(X_test)
#Creating confusion matrix
lr_cm = confusion_matrix(Y_test, knn_predictions)
lr_cm=lr_cm[::-1,::-1]
print(lr_cm)
#Plotting confusion matrix
df_cm = pd.DataFrame(lr_cm, columns=np.unique(Y_test), index = np.unique(Y_test))
df_cm.index.name = 'Actual'
df_cm.columns.name = 'Predicted'
df_cm
#plt.figure(figsize = (7,5))
sns.set(font_scale=1)#for label size
ax=sns.heatmap(df_cm, cmap="Blues", annot=True, linewidths=.5, fmt='d')# font size
ax.invert_yaxis()
ax.invert_xaxis()
tn, fp, fn, tp = lr_cm.ravel()
print("True Negatives: ",tn)
print("False Positives: ",fp)
print("False Negatives: ",fn)
print("True Positives: ",tp)
print("========================================")
print("Confusion Matrix Interpretations –")
#Total number of prediction
print("Total number of predictions carried out by model:", tp+fp+fn+tn)
#Accuracy
Accuracy = (tn+tp)*100/(tp+tn+fp+fn)
print("1. Accuracy:", round(Accuracy,2), ", which means", round(Accuracy,2) , "% of outcomes that were predicted by model are correct")
#Recall
Recall = (tp)*100/(tp+fn)
print("2. Recall (Sensitivity):",round(Recall,2), ", which means that when model predicts true positives" , round(Recall,2), "% of outcomes that were predicted by model are actually true positives.")
#Precision
Precision = (tp)*100 / (tp + fp)
print("3. Precision (Exactness):",round(Precision,2),", which means that when model predicts true positives," , round(Precision,2), "% of outcomes that were predicted by model are correct.")
#Specificity
Specificity = (tn) * 100/(tn+fp)
print("4. Specificity:",round(Specificity,2),", which means that when model predicts true negatives," , round(Specificity,2), "% of outcomes that were predicted by model are correct.")
print("========================================")
print("CLassification_metrics :")
print(classification_report(Y_test,knn_predictions))
print(Y_test)
print(knn_predictions)
# implementing AUC
from sklearn.metrics import roc_curve,auc
y_scores = knn_model.predict_proba(X_test)
fpr, tpr, threshold = roc_curve(Y_test, y_scores[:, 1])
roc_auc = auc(fpr, tpr)
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.title('ROC Curve of kNN')
plt.show()
With AUC = 0.88, it means our model is capable of distinguishing between classes 88% of the time.
#Accuracy score for random forest classifier
print("Accuracy score for K-nearest neighbors Classifier-%.2f" %accuracy_score(Y_test,knn_predictions))
print("F1 score for K-nearest neighbors Classifier-%.2f" %f1_score(Y_test,knn_predictions))
print("Precision score for K-nearest neighbors Classifier-%.2f" %precision_score(Y_test,knn_predictions))
print("Recall score for K-nearest neighbors Classifier-%.2f" %recall_score(Y_test,knn_predictions))
Based on the accuracy (0.83), precision (0.8) and a recall (0.8), the model is performing very well on the train set. That means, this model can be used to predict if agriculture returns plays important contribution in Gross national income of Argentina or not.
Similarly, models can be prepared for each country as per the requirements.
# creating dataframe
df_hypothesis2=pd.DataFrame(index=df_WDI.index)
df_hypothesis2['Agriculture_returns']=df_WDI['Australia'].astype(float)
df_hypothesis2['GI']=np.nan
for i in range(df_GI_v1.shape[0]):
df_hypothesis2['GI'].iloc[41+i]=df_GI_v1['Australia'].iloc[i]
df_hypothesis2['GI']=df_hypothesis2['GI'].fillna((df_hypothesis2['GI']).min())
df_hypothesis2['GI']=df_hypothesis2['GI'].astype(float)
df_hypothesis2['%_in_GNI']=round(((df_hypothesis2['Agriculture_returns']/df_hypothesis2['GI'])*100),2)
df_hypothesis2['Target']=np.nan
# here benchmark is 50%. if %_in_GNI is greater then 7.5%, then 'yes' agriculture plays important role in GNI else 'NO'
for i in range(df_hypothesis2.shape[0]):
if df_hypothesis2['%_in_GNI'].iloc[i] > float(50):
df_hypothesis2['Target'].iloc[i]=str('Yes')
else:
df_hypothesis2['Target'].iloc[i]=str('No')
print(df_hypothesis2.shape)
df_hypothesis2.head()
# carrying out features and targets
features2 = df_hypothesis2[df_hypothesis2.columns[range(0,3)]]
target2= df_hypothesis2.iloc[:,-1]
# labeling target
le = LabelEncoder()
target2=list(le.fit_transform(target2))
print(target2)
#Scaling features
# Load the standard scaler
sc = StandardScaler()
# Scale the feature data to be of mean 0 and of unit variance
features2_scaled = sc.fit_transform(features2)
print("features shape",features2.shape)
#Creating df_feature_scaled to use it in plot
df_features2_scaled=pd.DataFrame(features2_scaled, columns=features2.columns)
print("features_scaled shape",df_features2_scaled.shape)
#Plotting the difference between scaled training data and without scaled data
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(20, 10))
ax1.set_title("Before Scaling")
for i in range(features2.shape[1]):
sns.kdeplot(features2[features2.columns[i]],ax=ax1)
ax2.set_title("After Standard Scaling")
for i in range(df_features2_scaled.shape[1]):
sns.kdeplot(df_features2_scaled[df_features2_scaled.columns[i]],ax=ax2)
# Splitting dataset
X_train2, X_test2, Y_train2, Y_test2 = train_test_split(features2_scaled,target2,test_size=0.2, random_state=43)
#Creating X_train and X_test dataframes
df_X_train2 = pd.DataFrame(X_train2)
df_X_test2 = pd. DataFrame(X_test2)
print(df_X_train2.shape)
print(df_X_test2.shape)
# Using GridSearchCV to find best parameters for KNN classifier model
grid_knnclf2 = GridSearchCV(KNeighborsClassifier(), param_knn, cv=10, scoring='accuracy')
grid_knnclf2.fit(X_train2, Y_train2)
print(grid_knnclf2.best_score_)
knn_best_param2=grid_knnclf2.best_params_
print(knn_best_param2)
#fitting model
knn_model2 = KNeighborsClassifier(n_neighbors= knn_best_param2['n_neighbors'])
knn_model2.fit(X_train2, Y_train2)
# knn predictions
knn_predictions2= knn_model2.predict(X_test2)
#Creating confusion matrix
lr_cm2 = confusion_matrix(Y_test2, knn_predictions2)
lr_cm2=lr_cm2[::-1,::-1]
print(lr_cm2)
#Plotting confusion matrix
df_cm2 = pd.DataFrame(lr_cm2, columns=np.unique(Y_test2), index = np.unique(Y_test2))
df_cm2.index.name = 'Actual'
df_cm2.columns.name = 'Predicted'
df_cm2
#plt.figure(figsize = (7,5))
sns.set(font_scale=1)#for label size
ax=sns.heatmap(df_cm2, cmap="Spectral", annot=True, linewidths=.5, fmt='d')# font size
ax.invert_yaxis()
ax.invert_xaxis()
tn2, fp2, fn2, tp2 = lr_cm2.ravel()
print("True Negatives: ",tn2)
print("False Positives: ",fp2)
print("False Negatives: ",fn2)
print("True Positives: ",tp2)
print("========================================")
print("Confusion Matrix Interpretations –")
#Total number of prediction
print("Total number of predictions carried out by model:", tp2+fp2+fn2+tn2)
#Accuracy
Accuracy2 = (tn2+tp2)*100/(tp2+tn2+fp2+fn2)
print("1. Accuracy:", round(Accuracy2,2), ", which means", round(Accuracy2,2) , "% of outcomes that were predicted by model are correct")
#Recall
Recall2 = (tp2)*100/(tp2+fn2)
print("2. Recall (Sensitivity):",round(Recall2,2), ", which means that when model predicts true positives" , round(Recall2,2), "% of outcomes that were predicted by model are actually true positives.")
#Precision
Precision2 = (tp2)*100 / (tp2 + fp2)
print("3. Precision (Exactness):",round(Precision2,2),", which means that when model predicts true positives," , round(Precision2,2), "% of outcomes that were predicted by model are correct.")
#Specificity
Specificity2 = (tn2) * 100/(tn2+fp2)
print("4. Specificity:",round(Specificity2,2),", which means that when model predicts true negatives," , round(Specificity2,2), "% of outcomes that were predicted by model are correct.")
print("========================================")
print("CLassification_metrics :")
print(classification_report(Y_test2,knn_predictions2))
# implementing AUC
from sklearn.metrics import roc_curve,auc
y_scores2 = knn_model.predict_proba(X_test2)
fpr2, tpr2, threshold2 = roc_curve(Y_test2, y_scores2[:, 1])
roc_auc2 = auc(fpr2, tpr2)
plt.title('Receiver Operating Characteristic')
plt.plot(fpr2, tpr2, 'b', label = 'AUC = %0.2f' % roc_auc2)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.title('ROC Curve of kNN')
plt.show()
With AUC = 1, it means our model is capable of distinguishing between classes all of the time.
print(Y_test)
print(knn_predictions)
#Accuracy score for random forest classifier
print("Accuracy score for K-nearest neighbors Classifier-%.2f" %accuracy_score(Y_test2,knn_predictions2))
print("F1 score for K-nearest neighbors Classifier-%.2f" %f1_score(Y_test2,knn_predictions2))
print("Precision score for K-nearest neighbors Classifier-%.2f" %precision_score(Y_test2,knn_predictions2))
print("Recall score for K-nearest neighbors Classifier-%.2f" %recall_score(Y_test2,knn_predictions2))